Introduction

This case study is part of my Google Data Analytics Capstone Project, where I’ll guide you through the six phases of Data Analytics: Ask, Prepare, Process, Analyze, Share and Act. Throughout these phases, I’ll demonstrate my ability to identify business tasks and key metrics, clean and process data for analysis, apply appropriate statistical techniques for data analysis and create interactive visualization using ggplotly. The final steps, including cleaning, analysis and visualization of data, will be executed using R Studio.

About the Company

In 2016, Cyclistic launched a bike-share program in Chicago, which has since expanded to include 5,824 bicycles across 692 stations. Cyclistic offers flexible pricing plans, attracting a wide range of customers. However, finance analysts have found that annual members are more profitable than casual riders. To boost future growth, the marketing team aims to convert casual riders into annual members. To achieve this, they need to understand the differences in bike usage between the two groups.

ASK

Business Task:

Identify marketing strategies to convert casual riders into annual members. To do this, it is important to analyze how annual members and casual riders used Cyclistic bikes differently using the previous 12 months Cyclistic’s historical data.

Metrics that could be used to compare the ride patterns between casual and member riders:

  • Usage by type of bike (Docked vs Classic vs Electric bikes)
  • Duration of the trip in minutes
  • Peak time or day of the week
  • Most popular start and end stations

The key stakeholders for this project include the Director of Marketing, the Cyclistic executive team, and the marketing analytics department.

PREPARE

Data Source

For this analysis, data from the past 12 months (August 2023 - July 2024) was utilized. This data was stored on AWS cloud. Each monthly dataset contains essential details, including membership status (casual or member), type of bike used (classic, docked & electric), trip start and end times, and information about the start and end stations, along with their latitude and longitude coordinates.

We will use the following R packages:

  • tidyverse
  • lubridate
  • dplyr
  • rstatix
  • ggplot2
  • gtsummary
  • ggplotly
  • sf

PROCESS

To begin processing and cleaning the data, we will first import and merge the data sets.

Step 1: Import and merge datasets

# Define the path to the directory containing the CSV files
path <- "C:/Users/Smrut/OneDrive/Documents/Raw Data/Data"
#path <- "C:/Data Analytics Portfolio/Cyclistic Case Study/Subset_data"

# Get a list of all CSV files in the directory
files <- list.files(path = path, pattern = "*.csv", full.names = TRUE)

# Use lapply to read all files into a list of data frames
data_list <- lapply(files, read.csv)

# Optionally, combine all data frames in the list into one data frame
combined_data <- do.call(rbind, data_list)

# Dimensions of the data frame
dim(combined_data)  
## [1] 5699639      13
# See the first 6 rows of the dataset
head(combined_data, n=6)
##            ride_id rideable_type          started_at            ended_at
## 1 011C1903BF4E2E28  classic_bike 2023-09-23 00:27:50 2023-09-23 00:33:27
## 2 87DB80E048A1BF9F  classic_bike 2023-09-02 09:26:43 2023-09-02 09:38:19
## 3 7C2EB7AF669066E3 electric_bike 2023-09-25 18:30:11 2023-09-25 18:41:39
## 4 57D197B010269CE3  classic_bike 2023-09-13 15:30:49 2023-09-13 15:39:18
## 5 8A2CEA7C8C8074D8  classic_bike 2023-09-18 15:58:58 2023-09-18 16:05:04
## 6 03F7044D1304CD58 electric_bike 2023-09-15 20:19:25 2023-09-15 20:30:27
##               start_station_name start_station_id
## 1    Halsted St & Wrightwood Ave     TA1309000061
## 2         Clark St & Drummond Pl     TA1307000142
## 3  Financial Pl & Ida B Wells Dr           SL-010
## 4         Clark St & Drummond Pl     TA1307000142
## 5    Halsted St & Wrightwood Ave     TA1309000061
## 6 Southport Ave & Wrightwood Ave     TA1307000113
##                 end_station_name end_station_id start_lat start_lng  end_lat
## 1 Sheffield Ave & Wellington Ave   TA1307000052  41.92914 -87.64908 41.93625
## 2     Racine Ave & Fullerton Ave   TA1306000026  41.93125 -87.64434 41.92557
## 3           Racine Ave & 15th St          13304  41.87506 -87.63314 41.86127
## 4       Racine Ave & Belmont Ave   TA1308000019  41.93125 -87.64434 41.93974
## 5     Racine Ave & Fullerton Ave   TA1306000026  41.92914 -87.64908 41.92557
## 6                                                41.92884 -87.66387 41.90000
##     end_lng member_casual
## 1 -87.65266        member
## 2 -87.65842        member
## 3 -87.65663        member
## 4 -87.65887        member
## 5 -87.65842        member
## 6 -87.64000        member
remove(files)
remove(data_list)

Step 2: Process and clean data using data wrangling techniques. We need to retain records with complete membership status and valid start and end stations, ensuring all variables are converted to their appropriate data types.

# install.packages("tidyverse")
# install.packages("lubridate")

# Load the packages
library(tidyverse)
library(lubridate)

trips_data <- combined_data %>% 
  filter(member_casual != '' & start_station_id != '' & end_station_id != '') %>% #excluding missing records with subscription status, start and end station details
  rename(bike_type = rideable_type) %>%  #rename column as bike_type
  mutate(bike_type = factor(bike_type, levels = c("docked_bike", "classic_bike", "electric_bike"),  labels = c("Docked bike", "Classic bike", "Electric bike")), #assigning value labels to bike_type
         member_casual = factor(member_casual, levels = c("member", "casual"),  labels = c("Member", "Casual")), #assigning value labels to member_casual
         started_at = ymd_hms(started_at), # converting char to date time
         ended_at = ymd_hms(ended_at)) %>%  # converting char to date time 
        filter(start_lat < 42.02 & end_lat < 42.02) #excluding latitude with top coordinates greater than 42.0 since these communities are outsider Chicago boundaries

# Remove the combines_data to free up disk space

rm(combined_data)

Step 3: Identify and remove duplicate rows based on the ride_id and start time to ensure data accuracy and consistency.

## Check for duplicates based on ride id (CC numbers which could be repeatedly used by a rider hence may not be unique per ride), start_station and start time
trips_data_1 <- trips_data[!duplicated(trips_data$ride_id , trips_data$started_at ), ] 

rm(trips_data)

Step 4: Calculate new variables for analysis: * Trip duration in mins (ended_at - started_at) * Time, Date, month and year from the start date

#install.packages("hms") to extract the time from start date time variable
library(hms)
trips_data_2 <- trips_data_1 %>% 
  mutate(duration_trip = as.numeric(difftime(ended_at, started_at,units = c("mins"))), #Trip duration as diff between (ended_at - started_at) 
         time_trip = as_hms(started_at), 
         day_trip = day(started_at), #extract date from start date
         month_trip = month(started_at, label = TRUE, abbr = FALSE ), #extract month from start date
         year_trip = year(started_at), #extract year from start date
         week_day_trip = wday(started_at,label = TRUE, abbr = FALSE )) #extract weekday from start date
  
library(ggplot2)
library(scales)
# Identifying outliers in duration using a histogram
ggplot(trips_data_2, aes(x=duration_trip)) +
   ggtitle("Histogram plot for trip duration with outliers") + 
  xlab ("Trip duration in mins") +
  geom_histogram(binwidth=1, color="black", fill="white") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

rm(trips_data_1)

Step 5: Removing extreme outliers and arriving at the final dataset for analysis

#install.packages("rstatix")
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
outliers<- trips_data_2 %>%
  identify_outliers(duration_trip)  #Identifying extreme outliers

outliers <- outliers[c("ride_id","is.extreme")] #Keeping column with ride_ids and extreme outliers

trips_data_3 <- merge(x = trips_data_2, y = outliers, by = "ride_id", all.x = TRUE) #Merging using left join 

#Final dataset for analysis
trips_data_4 <- trips_data_3 %>% 
  mutate(is.extreme = ifelse(is.na(is.extreme), "FALSE", is.extreme)) %>% #replacing the missing values with FALSE
  filter(!duration_trip <= 1 & # Excluding rides with a duration of less than 1 minutes or duration in negative, as it is assumed these rides were not completed or the date times were not correctly entered
           is.extreme %in% c("FALSE")) # Excluding the extreme outliers assuming data processing errors

# Histogram without outliers
ggplot(trips_data_4, aes(x=duration_trip)) +
   ggtitle("Histogram plot for trip duration without extreme outliers") + 
  xlab ("Trip duration in mins") +
  geom_histogram(binwidth=1, color="black", fill="white") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

rm(outliers)
rm(trips_data_2)
rm(trips_data_3)

Analyze

Step 5: Analyse the data

The following analysis was conducted to observe the ride patterns between member vs casual riders:

  • The summary of all rides
  • Compare the distribution of type of bike between member and casual riders
  • Compare the duration of the trips taken by the member vs casual riders
  • Compare the average ride duration by each day of the week for members vs casual users
  • Compare the number of rides by each day of the week and time of the day for members vs casual users
  • Most frequent start and end stations

Table 1: Summary of all ride characteristics

#install.packages("gtsummary")
library(gtsummary)

trips_data_4 %>%  
  select(bike_type, member_casual, duration_trip, day_trip, month_trip, week_day_trip) %>% #selecting vars for summary
  tbl_summary(
    label = list(bike_type = "Type of bike", member_casual = "Subscription status", 
                 duration_trip = "Trip duration (mins)", 'day_trip' = "Day of the trip", 
                 'month_trip' = "Month of the trip", 'week_day_trip' = "Week of the trip",
                 'year_trip' = "Year of the trip"), #assigning variable labels to display
    type = all_continuous() ~ "continuous2", #summary stat to continue on another row
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}")), #stats
    digits = (list(all_categorical() ~ c(0, 1))) #display percentage with 1 decimal
  )
Characteristic N = 3,966,7871
Type of bike
    Docked bike 0 (0.0%)
    Classic bike 2,640,104 (66.6%)
    Electric bike 1,326,683 (33.4%)
Subscription status
    Member 2,637,277 (66.5%)
    Casual 1,329,510 (33.5%)
Trip duration (mins)
    Mean (SD) 13 (10)
    Median (Q1 - Q3) 10 (6 - 17)
    Min - Max 1 - 55
Day of the trip
    Mean (SD) 16 (9)
    Median (Q1 - Q3) 15 (8 - 23)
    Min - Max 1 - 31
Month of the trip
    January 108,627 (2.7%)
    February 177,318 (4.5%)
    March 219,617 (5.5%)
    April 279,705 (7.1%)
    May 411,192 (10.4%)
    June 466,165 (11.8%)
    July 508,860 (12.8%)
    August 512,251 (12.9%)
    September 475,586 (12.0%)
    October 383,021 (9.7%)
    November 263,419 (6.6%)
    December 161,026 (4.1%)
Week of the trip
    Sunday 512,155 (12.9%)
    Monday 527,302 (13.3%)
    Tuesday 557,795 (14.1%)
    Wednesday 602,352 (15.2%)
    Thursday 579,634 (14.6%)
    Friday 572,646 (14.4%)
    Saturday 614,903 (15.5%)
1 n (%)
#Pie Chart to show the distribution of member and casual riders
# Proportion table for grouped bar chart

tab1 <- table(trips_data_4$member_casual) #frequency table
prop1 <- as.data.frame(prop.table(tab1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","prop","percent") #renaming col names

# Pie chart
p1 <- ggplot(data=prop1, aes(x="", y=percent, fill=member_casual)) + 
geom_col(color = "black") +
  coord_polar(theta = "y")+ 
geom_text(aes(label = percent), size = 3, position = position_stack(vjust = 0.5), vjust = -1, color = "black") 

p1 +  theme_void() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) 

# Calculate the total trip duration

trips_data_4 %>%  
  select(member_casual, duration_trip) %>% #selecting vars for summary
  summarise(sum_duration = sum(duration_trip)
  )
##   sum_duration
## 1     51900904
# Calculate the total trip duration by subscription status

trips_data_4 %>%  
  select(member_casual, duration_trip) %>% #selecting vars for summary
  group_by(member_casual) %>% 
  summarise(sum_duration = sum(duration_trip)
  )
## # A tibble: 2 × 2
##   member_casual sum_duration
##   <fct>                <dbl>
## 1 Member           30609442.
## 2 Casual           21291462.

Table 2: Comparison of the Type of Bike Between Members and Casual Riders

trips_data_4 %>%  
  select(bike_type, member_casual) %>% #selecting vars for summary
  tbl_summary(
    by = member_casual, #grouping variable
    label = list(bike_type = "Type of bike", member_casual = "Subscription status"), #assigning variable labels 
    digits = list(all_categorical() ~ c(0, 1)), #display percentage with 1 decimal
  ) %>% 
  add_p() # Add chi-square/fishers exact p value
## The following errors were returned during `add_p()`:
## ✖ For variable `bike_type` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
Characteristic Member
N = 2,637,277
1
Casual
N = 1,329,510
1
p-value
Type of bike


    Docked bike 0 (0.0%) 0 (0.0%)
    Classic bike 1,786,184 (67.7%) 853,920 (64.2%)
    Electric bike 851,093 (32.3%) 475,590 (35.8%)
1 n (%)
# Proportion table for grouped bar chart

tab1 <- table(trips_data_4$member_casual, trips_data_4$bike_type) #frequency table
prop1 <- as.data.frame(prop.table(tab1, margin = 1)) #proportion with denominator as member_casual var
prop1$percent <- as.numeric(format(round(prop1$Freq*100,1),nsmall=1)) # percentage calculation
colnames(prop1) <- c("member_casual","bike_type","prop","percent") #renaming col names

# Grouped bar chart
p <- ggplot(data=prop1, aes(x=bike_type, y=percent, fill=member_casual)) + 
geom_bar(stat="identity", position=position_dodge()) +
  ylim (0,100) + 
  xlab("Type of the bike") + 
  ylab ("Percentage") +
  ggtitle("Distribution of usage by Bike Type") +
geom_text(aes(label = percent), size = 3, position = position_dodge(width = .9), vjust = -1, color = "black") 

p + theme_classic() +  scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  annotate(geom = "text", x=3, y=90, label = "Pearson’s Chi-squared test \n p<0.001")

Table 3: Comparison of Trip Duration Between Members and Casual Riders

# Histogram to check the distribution between member vs casual riders
p<- ggplot(trips_data_4, aes(x=duration_trip,fill=member_casual, color = member_casual)) +
  geom_histogram(alpha = 0.5, position="identity", binwidth=1) +
   theme_classic() +
  labs(title = "Distribution plot of Trip Duration (mins)") +
  xlab ("Trip duration (mins)") +
  ylab ("Count")
#p2 + labs(fill = "Subscription status",
 #        color = "Subscription status")

p +  scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_x_continuous(breaks = seq(0, 60, by = 10)) 

Since the distribution for both the groups seems to be positively skewed, we use box plot for graphical representation of the trip duration between member vs casual riders.

trips_data_4 %>%  
  select(duration_trip, member_casual) %>%  #selecting vars for summary
  tbl_summary(
    by = member_casual,  #grouping variable
    label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),#assigning variable labels 
    type = all_continuous() ~ "continuous2", #each summary stat to continue on another row
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
  )  %>% 
  add_p() # add p value for wilcoxon signed rank test
## The following errors were returned during `add_p()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
Characteristic Member
N = 2,637,277
Casual
N = 1,329,510
p-value
Trip duration (mins)


    Mean (SD) 12 (9) 16 (12)
    Median (Q1 - Q3) 9 (5 - 15) 12 (7 - 21)
    Min - Max 1 - 55 1 - 55
# Boxplot 
p <- ggplot(trips_data_4, aes(x=member_casual, y=duration_trip, fill=member_casual)) + 
  stat_boxplot(geom ='errorbar' , width = 0.5) + 
  geom_boxplot(width = 0.5, outlier.colour="black", outlier.shape = 20, outlier.size=1)+
  labs(title="Trip Duration" , x="Subscription status", y = "Trip duration (mins)")
p + theme_classic()  + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  stat_summary(fun=mean, geom = "point", shape = 2, size = 2, color = "black") +
  annotate(geom = "text", x=2.3, y=50, label = "Wilcoxon rank \n sum test \n p<0.001")

Table 4: Compare the average ride duration by each day of the week for members vs casual users

trips_data_4 %>%  
  select(duration_trip,  week_day_trip, member_casual) %>% #selecting vars for summary
  tbl_strata( strata = week_day_trip, # Stratifying variable
              .tbl_fun = 
                ~ .x %>% 
                tbl_summary(
    by = c(member_casual), #grouping variable
    label = list(duration_trip = "Trip duration (mins)", member_casual = "Subscription status"),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous() ~ c("{mean} ({sd})", "{median} ({p25} - {p75})","{min} - {max}"))
  )  %>% 
  add_p(), #add p value for wilcoxon signed rank test
  .header = "**{strata}**, N = {n}"
  )
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
## The following errors were returned during `tbl_strata()`:
## ✖ For variable `duration_trip` (`member_casual`) and "p.value" statistic: The
##   package "cardx" (>= 0.2.1) is required.
Characteristic Sunday, N = 512155 Monday, N = 527302 Tuesday, N = 557795 Wednesday, N = 602352 Thursday, N = 579634 Friday, N = 572646 Saturday, N = 614903
Member
N = 288,591
Casual
N = 223,564
p-value Member
N = 376,567
Casual
N = 150,735
p-value Member
N = 411,476
Casual
N = 146,319
p-value Member
N = 435,797
Casual
N = 166,555
p-value Member
N = 415,366
Casual
N = 164,268
p-value Member
N = 374,790
Casual
N = 197,856
p-value Member
N = 334,690
Casual
N = 280,213
p-value
Trip duration (mins)




















    Mean (SD) 13 (10) 18 (12)
11 (9) 15 (12)
11 (9) 14 (11)
11 (9) 15 (11)
11 (9) 14 (11)
11 (9) 16 (12)
13 (10) 18 (12)
    Median (Q1 - Q3) 10 (6 - 17) 14 (8 - 24)
8 (5 - 14) 12 (7 - 20)
9 (5 - 14) 11 (7 - 19)
9 (5 - 15) 11 (7 - 19)
9 (5 - 14) 11 (7 - 19)
9 (5 - 14) 12 (7 - 21)
10 (6 - 17) 15 (9 - 24)
    Min - Max 1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
1 - 55 1 - 55
#Calculating average duration for each week day between member and casual riders
t1 <- trips_data_4 %>% 
  group_by(member_casual, week_day_trip) %>%  #groups by subscription status and weekday
  summarise(number_of_rides = n()                           #calculates the number of rides and average duration 
            ,average_duration = mean(duration_trip),
            sd_duration = sd(duration_trip)) %>%        # calculates the average duration
  arrange(member_casual, week_day_trip) 
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
# Column Bar chart
#Plot of Average Trip Duration by Day of the Week Between Member vs Casual Riders

p <- ggplot(t1, aes(x=week_day_trip, y=average_duration, group=member_casual)) + 
  geom_line(aes(color=member_casual))+ 
  geom_point(aes(color=member_casual)) +
   #geom_errorbar(aes(y=), width=.2,
                # position=position_dodge(0.05)) +
  labs(title="Average Trip Duration by Day of the Week" , 
       x="Day of the Week", y = "Trip duration (mins)") +
  ylim(0,20)
p + theme_classic() + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) 

 #+   annotate(geom = "text", x=2.3, y=50, label = "Wilcoxon rank \n sum test \n p<0.001")

#Plot of Number of Rides by Day of the Week Between Member vs Casual Riders

p <- ggplot(t1, aes(x=week_day_trip, y=number_of_rides, fill=member_casual)) + 
  geom_col(position="dodge")+
  labs(title="Number of Rides by Day of the Week" , 
       x="Day of the Week", y = "Number of rides")
p + theme_classic() + scale_fill_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Next, we will visualize the average number of rides taken at the specific time of the day.

#Calculating average number of rides between member and casual riders by time of the day
t2 <- trips_data_4 %>% 
  group_by(member_casual, hour = hour(time_trip)) %>%  #groups by subscription status and every hour
  mutate(number_of_rides = n()) %>% 
  ungroup()


p<-ggplot(t2, aes(x=time_trip, y=number_of_rides, group=member_casual)) +
  geom_line(aes(color=member_casual)) + theme_classic() #+ geom_point(aes(color=member_casual))

breaks <- c(0, 6,12,18,24)*3600
labels <- format(as.POSIXct(breaks, tz = "UTC"), format = "%H:%M")

p + scale_color_manual(name = "Subscription status", values=c("#136759","#8fed49")) + 
  labs(title="Number of Rides by Time of the Day" , 
       x="Time of the Day", y = "Number of rides") +scale_x_time(breaks = breaks, labels = labels)

In the following analysis, we will visualize the most frequent start and end stations based on the number of rides, and explore how this distribution varies between member and casual riders.

Plot x: Visualization for the most common start stations by subscription status

stn_name_final <- trips_data_4 %>% 
  group_by(start_station_id) %>%  #groups by start_station_id
  mutate(start_station_name_new = names(which.max(table(start_station_name)))) %>% 
  ungroup() %>% 
 group_by(end_station_id) %>%  #groups by start_station_id
  mutate(end_station_name_new = names(which.max(table(end_station_name)))) 


#Getting the one set of coordinate for each station id
station_info <- stn_name_final %>% 
   dplyr::group_by(start_station_id, start_station_name_new) %>%  #groups by start_station_id
  mutate (start_lat_final = mean(start_lat), start_lng_final = mean(start_lng)) %>% 
  dplyr::ungroup() %>% 
   dplyr::group_by(end_station_id, end_station_name_new) %>%  #groups by end_station_id
    mutate (end_lat_final = mean(end_lat), end_lng_final = mean(end_lng)) %>% 
  dplyr::ungroup()

#station_data <- trips_data_4 %>% 
# distinct(start_station_name, start_station_id)

#station_dup <- station_data[duplicated(station_data$start_station_id), ]

# calculating the number of rides for each start station taken by member and casual riders
start_station_unique <- station_info %>% #filter(member_casual == "Casual") %>% 
  group_by(member_casual, start_station_id, start_station_name_new, start_lat_final, start_lng_final) %>%  
  summarise(number_of_rides = n())  %>% #calculates the number of rides 
  ungroup()

end_station_unique <- station_info %>%  #filter(member_casual == "Casual") %>% 
  group_by(member_casual, end_station_id, end_station_name_new, end_lat_final, end_lng_final) %>%  
  summarise(number_of_rides = n())  %>%     #calculates the number of rides 
  ungroup()

start_station_unique <- start_station_unique %>% 
  mutate(numBins = ifelse(number_of_rides <= 10000, number_of_rides,10000))

start_station_unique <- start_station_unique %>% 
  mutate(numBins = case_when(number_of_rides <= 2500 ~ "<2500",
                             number_of_rides > 2500 & number_of_rides <= 7500 ~ "2500 - 7500",
                             number_of_rides > 7500 & number_of_rides <= 10000 ~ "7500 - 10000",
                             number_of_rides > 10000 ~ ">10000"))                          

  
#install.packages("plotly")
library(plotly)
library(sf)
shape <- read_sf(dsn = "C:/Users/Smrut/OneDrive/Documents/Raw Data/Data", 
                 layer = "geo_export_36491657-9a1b-46a0-a113-872487cb2834")
p <- ggplot(shape) +
  geom_sf(fill = "grey", color = "white")

# Visualization for most common start points for rides taken by member vs casual riders

q <- p + geom_point(data=start_station_unique ,aes(x=start_lng_final,y=start_lat_final, color=numBins, 
                                             text = paste("Start station:", start_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Starting Stations with Number of Rides in Chicago") + 
  facet_grid(. ~ member_casual, axes = "all") + 
    scale_colour_brewer(palette = "RdYlGn", direction = -1, name = "Number of Rides")

ggplotly(q, source = "select", tooltip = c("text"))

Plot x: Visualization for the most common end stations by subscription status

# Visualization for most common start points for rides taken by member vs casual riders

q <- p + geom_point(data=end_station_unique ,aes(x=end_lng_final,y=end_lat_final, color=number_of_rides, 
                                             text = paste("End station:", end_station_name_new, "\n", "No. of rides:", number_of_rides)),size=1,alpha=0.75) + xlab("Longitude") + ylab("Latitude") + ggtitle("Distribution of Ending Stations with Number of Rides in Chicago") + 
  facet_grid(. ~ member_casual, axes = "all") +
  scale_colour_gradient(high="red",low='green', name = "Number of Rides") 

ggplotly(q, source = "select", tooltip = c("text"))

### SHARE

ACT